Genetic influences on human blood metabolites in the Japanese population

Summary An increase in ethnic diversity in genetic studies has the potential to provide unprecedented insights into how genetic variations influence human phenotypes. In this study, we conducted a quantitative trait locus (QTL) analysis of 121 metabolites measured using gas chromatography-mass spectrometry with plasma samples from 4,888 Japanese individuals. We found 60 metabolite-gene associations, of which 13 have not been previously reported. Meta-analyses with another Japanese and a European study identified six and two additional unreported loci, respectively. Genetic variants influencing metabolite levels were more enriched in protein-coding regions than in the regulatory regions while being associated with the risk of various diseases. Finally, we identified a signature of strong negative selection for uric acid (Sˆ = −1.53, p = 6.2 × 10−18). Our study expanded the knowledge of genetic influences on human blood metabolites, providing valuable insights into their physiological, pathological, and selective properties.


INTRODUCTION
Genome-wide quantitative trait locus (QTL) analysis of metabolites (mQTL) in the European population (EUR) has identified hundreds of common to rare genetic variants associated with human blood metabolites. [1][2][3][4][5][6][7][8][9][10] These studies have provided heritability estimates of multiple metabolites, insights into the biochemical pathways, and downstream functional implications of disease-associated variants. However, an imbalance in the study population remains a significant limitation of these studies. As in most genome-wide association studies (GWASs), most of such studies target EURs. 11 This imbalance limits our understanding of the genetic influence on metabolites for two reasons. One reason is that the analysis of complex traits generally strongly depends on genetic architecture. 12,13 The other reason is that metabolite levels are affected by environmental factors such as diet, lifestyle, and physical activity, which are often different among ethnicities. The comparison of QTL profiles among multiple ethnicities and the meta-analysis across populations will lead to a better understanding of genetic influences on human blood metabolites. Recent large-scale GWAS of lipids has indicated that a study comprising multiple ancestries could detect candidate genes or variants more efficiently than a study comprising single ancestry. 14 Furthermore, studies on non-EURs enable us to understand the downstream effects of genetic variants associated with diseases or other traits identified in those populations. Although some studies have focused on non-EURs, such as the Latino, 15 Middle Eastern, 16 African American, 17 and Japanese, 18,19 they employed a relatively small number of participants. There is still an unmet need for large-scale studies focusing on non-EURs.
Previous studies have identified thousands of metabolite-disease associations to date 20 ; however, their relationship to fitness has not been explored. If the level of a metabolite is beneficial or detrimental to adaptation, the allele of the genetic variants influencing the metabolite level should be under selection pressure. We can infer such natural selection signatures by uncovering the genetic impacts on metabolites.
Overall, analyzing genetic influences on metabolites in the Japanese population has the potential to obtain unprecedented physiological, pathological, and selective insights into them. With these goals in mind, we conducted a GWAS of the blood levels of 121 metabolites quantified by gas chromatography-mass spectrometry (GC-MS) in a Japanese community-based cohort.
Of the 60 gene-metabolite pairs, 13 pairs consisting of 11 genes and ten metabolites were not reported ( Figure S11). Among the 11 genes, SOD3 was not reported to be associated with metabolites ( Figure S11, orange). The remaining ten genes have been reported for their association with other metabolites but not with those identified in this study ( Figure S11, green/cyan-colored). Two out of the ten genes, namely ACP1 and DCXR, were associated with ribulose levels for which no QTLs have been shown ( Figure S11 in green).
Second, to explore additional variants independently influencing the metabolite levels, we performed a conditional analysis on the variant showing the strongest association (referred to as a lead variant) at each mQTL. We identified 15 such variants for eight metabolites (p < 4.1 3 10 À10 ) (Table S3, Figure S12-S21). We assigned a gene to each of those variants and found that five genes had two or more variants associated with five metabolites (Table S3, shown in bold). The sum of the allele dosage of the lead and additional variants that increased the level of a metabolite increased the level of that metabolite (Figure S22). The proportion of variance explained by the lead and additional variants was greater than 10% in four metabolites (3-aminoisobutyric acid, fucose, mannose, and proline) (Table S1).

Fourteen gene-metabolite pairs were localized in known biochemical pathways
Of the 60 gene-metabolite pairs, 14 were mapped to known biochemical pathways (Table S4). The DCXRribulose pair, the only pair newly identified in this study, was on the pentose and glucuronate interconversion pathway. DCXR is an enzyme which reduces xylulose to xylitol in the multiple biochemical reactions toward ribulose on that pathway. Rs60208666 is located 497-bp upstream of the transcription initiation site of DCXR and is predicted to be in the promoter region. 22 The G allele of rs60208666, which was reported to increase the expression level of DCXR, 23 decreased ribulose level (beta = À0.51).

Comparison and the multi-population analysis with the EUR
To clarify the genetic background of different association signals between the European and Japanese populations, we compared the allele frequencies of the lead variants in 47 known and 13 novel mQTLs. As a result, minor allele frequencies (MAF) of novel variants in the EUR of the 1000 Genomes Project (1 KG) Phase3 were lower than those of known variants (p = 0.065, Mann-Whitney U test; Figure 1A). On the other hand, there was no difference in their MAF in the East Asian population (EAS) (p = 0.27; Figure 1B). Furthermore, we examined whether previously reported variant-metabolite associations in the EUR were observed in the Japanese population. First, we extracted 556 variant-metabolite pairs showing association in Europeans and examined their associations in Japanese (see STAR Methods) (Table S5). One hundred forty pairs showed association (p < 0.05), of which 116 showed directional consistency (82.9%, p for sign test = 1.2 3 10 À15 ). When we set the association p value lower than 5.0 3 10 À5 , 43 variants remained, and all showed directional consistency with those of Europeans. MAF of the 43 variants in 1 KG Phase3 were significantly higher than those of the remaining 513 variants in EAS (p = 0.045, Figure 1C) but not in EUR (p = 0.11; Figure 1D). In addition, we estimated the genetic correlations for 29 of the 77 metabolites measured in both Japanese and European studies. 4 Forty-eight metabolites were not included in the analysis due to their low heritability (less than 1.0 3 10 À3 ) either in Europeans or in Japanese. We found that 13 metabolites showed a significant difference in genetic effects (p < 6.5 3 10 À4 ) between the two populations ( Table 2). The 13 metabolites included tryptophan (an essential amino acid) and pyruvic acid (the end product of glycolysis), both of which are influenced by dietary habits.

Meta-analyses discovered additional gene-metabolite pairs
We performed a meta-analysis to identify more mQTLs. First, we used the summary statistics of 55 metabolites in the Tohoku Medical Megabank Organization (ToMMo), 24 a Japanese community-based cohort (Table S1). We identified 62 gene-metabolite pairs with a significant association (p < 4.1 3 10 À10 , Table S6). Of these, 36 pairs were not detected in our own dataset, and six were not reported. Ten of the 36 pairs were on the known biological pathways. Subsequently, we conducted a multipopulation meta-analysis with a European study. 4 We tested 77 metabolites that were included either in the three datasets (44 metabolites) or in the European and our own studies (33 metabolites) (Table S1). We identified 98 gene-metabolite pairs (p < 4.1 3 10 À10 , Table S7, Figures S34-S50), of which 30 were not found in either our own study or the above Japanese meta-analysis. Among the 30 gene-metabolite associations, two have not been reported previously. Eight out of them were mapped to the known biological pathways.

Genetic variants influencing metabolite levels were enriched in the coding region
We annotated the function of the variants in linkage disequilibrium (LD) (r 2 > 0.8) with 60 lead (Table 1) and 15 additional variants obtained by the conditional analysis (Table S3) in the 1 KG EAS Phase3. We found that as many as 44 variants (58.7%) had two or more proxy variants in the coding regions. To investigate whether such a high percentage of coding proxy variants exist in the genome, we conducted an enrichment analysis by a random permutation (1,000 times) of chromosome and MAF-matched 75 variants. On average, 8.2 variants (10.9%) had proxy variants in the coding regions, demonstrating a significant enrichment of 5.4-fold (p = 2.0 3 10 À3 ) ( Figure 2A, Table S8). In addition, we found enrichment in nonsynonymous (8.2-fold, p = 2.0 3 10 À3 ) ( Figure 2B) and frameshift or stop gain (10.2-fold, p = 0.02) ( Figure 2C) variants. On the other hand, there was no significant enrichment in the regulatory region ( Figure 2D). We compared the results iScience Article with clinical measurement QTL, 25 expression QTL (eQTL), 23 and disease-associated loci 26 of the Japanese population. The enrichment of coding region variants was significantly higher (5.4-fold) than that of clinical measurement QTL and disease-associated loci (2.1-and 1.9-fold, respectively) ( Figure 2, Table S8). Similarly, the nonsynonymous variants were significantly more enriched (8.2-fold) compared to clinical measurement QTL (2.9-fold). On the contrary, regulatory region variants were more enriched in clinical measurement QTL, eQTL, and disease-associated loci. Such enrichment has been reported for eQTL of non-EAS 27 and disease-associated loci. 28

mQTL was associated with multiple clinical measurements and diseases
To investigate the impact of mQTL on clinical measurements, we performed a phenome-wide association study (PheWAS) using summary statistics of a Japanese GWAS. 25 We found 138 significant mQTL-clinical measurement associations (Table S9). Associations of variants in the GCKR region were the most abundant (43 associations), followed by the ALDH2 region (40 associations). The GCKR region was reported as being associated with dietary habits 29 in the Japanese population. The metabolites associated with GCKR were mannose (a sugar monomer) and threonine (an essential amino acid). 2-hydroxyisovaleric acid and 2-oxoisocaproic acid, which were associated with ALDH2, showed significant associations with drinking iScience Article frequency (p = 2.4 3 10 À20 and 3.7 3 10 À6 , respectively) (Table S1). Because the ALDH2 genotypes are known to be associated with drinking habits in Japanese, 29 differences in alcohol consumption might explain the pleiotropic effects of ALDH2.
Associations of variants in the CPS1 region were the third most abundant (24 associations, Figure 3, Table S9). The A allele of rs1047891 in CPS1 increased creatinine, alanine aminotransferase (ALT), mean corpuscular volume (MCV), and mean corpuscular hemoglobin (MCH). In contrast, it decreased blood urea nitrogen (BUN), uric acid, and estimated glomerular filtration rate (eGFR). Furthermore, the T allele of rs72933889 was in LD with rs1047891 (r 2 = 0.79 in 1 KG EAS) and affected creatinine and ALT levels and eGFR in the same direction as the A allele of rs1047891. However, the association with BUN lost significance (p = 9.3 3 10 À5 ).
Subsequently, we conducted an association analysis on 109 of the above 138 association pairs using the phenotype information of the Nagahama Study. We confirmed significant associations of 62 pairs (p < 0.05), and all of them showed a direction consistency (Table S9).
Among the 57 mQTL-disease association pairs, the largest pair number was obtained for ALDH2 (19 pairs), followed by GCKR (17 pairs). On the other hand, gout and macular telangiectasia type 2 were associated with the largest number of mQTL (n = 6), followed by metabolic syndrome (n = 4) ( Figure 4, Table S10). ABCG2 and SLC2A9 were also identified as mQTL of uric acid, the causative agent of gout. 30 Macular telangiectasia type 2 is a rare neurodegenerative retinal disease characterized by low blood glycine and serine levels. 31 Indeed, the C allele of rs1047891 in CPS1 decreased the blood levels of these amino acids and increased disease risk ( Figure 4, Table S10).
To further investigate the association between mQTL and disease, we looked up the summary statistics of multi-population GWAS for 1,326 disease phenotypes in the UK BioBank. We found 34 mQTL-disease associations comprising six metabolites and 20 diseases (Table S11, p meta < 6.3 3 10 À7 ). The 20 diseases were classified into seven endocrine-metabolic, five digestive, and four neoplasms and dermatologic disorders. We examined whether diseases in a specific category were enriched in the 20 diseases by an  iScience Article over-representation analysis. As a result, endocrine and metabolic diseases showed significant enrichment (p = 4.0 3 10 À4 ) (Table S12).
Among the 34 mQTL-disease associations, 26 were examined in Europeans and one or more non-EURs, corresponding to 88 European/non-European/disease combinations. We compared the direction of the allelic effect in these combinations and found that 66 (75.0%) showed a direction consistency (sign test; p = 2.9 3 10 À6 ).
Selection signature analysis revealed a strong negative selection signature for uric acid We estimated the selection signatures of the genetic effects on metabolites using the BayesS model where S < 0 and S > 0 indicate the negative and positive selection, respectively. 32 We obtained a strong negative selection signature in uric acid ( b S = À1.0, p = 1.5 3 10 À11 ) and phenylalanine ( b S = À1.0, p = 1.9 3 10 À5 ) ( Figure S51A). Next, we applied the same method to assess the polygenetic effect on the selection signature after excluding the variants that strongly influenced metabolite levels. More specifically, we removed the lead mQTL variant, those located within 500 kb around it, and the variants in the CPS1, PPM1K, and DPEP1 regions that showed pleiotropic effects on three or more metabolites. Only uric acid showed a strong negative selection signature ( b S = À1.5, p = 6.2 3 10 À18 ) ( Figure 5), suggesting the presence of multiple other variants under negative selection pressure. In contrast, the polygenicity of phenylalanine increased from 3.0 3 10 À3 to 0.06 (Figures S51B and S51C).

DISCUSSION
A genome-wide mQTL analysis of 121 human blood metabolites in the Japanese population identified 60 gene-metabolite associations consisting of 46 genes and 44 metabolites. Of note, 13 genetic variants were identified for the first time. Allele frequencies of these variants were lower than those already reported in the EUR ( Figure 1A). In contrast, allele frequencies of associated variants in Europeans but not in Japanese were lower in EAS than those showing associations in both populations ( Figure 1C). The difference in statistical power due to allele frequencies might partly explain the population-specific genetic associations.
The direction of allelic effects on metabolites was consistent between the European and Japanese populations for most variants showing a strong association (Table S5). However, the allele effects on 13 metabolites estimated by genetic correlation analysis differed between the two populations ( Table 2). Differences in environmental factors, such as dietary habits, might explain these.
Of the 13 newly identified gene-metabolite association pairs, only the DCXR-ribulose pair was mapped on the single pathway (Table S4). However, we found three other cases where the gene and the metabolite are linked through another metabolite involved in two or more metabolic pathways. In the PYCRL-pipecolinic acid association, 5-aminopentaonate is on the arginine and proline metabolic pathway in which PYCRL is engaged and on the lysine degradation pathway involving pipecolinic acid (Table S13). It also intermediates the PRODH2-pipecolinic acid association (Table S13). Another example is the association between ACP1 and ribulose which is intermediated by ribulose-5P. The DPEP1-cystine association could be physiologically implicated by combining the present findings and the results of previous studies. The T allele of rs79068217 in DPEP1 decreases DPEP1 expression (Table S2). On the other hand, the expression levels of DPEP1 and SLC3A2, a cystine transporter, were reported to correlate positively in the human kidney. 33 These suggest that the T allele of rs79068217 reduces the expression level of SLC3A2, resulting in decreased cystine excretion into the urine and, thus, increased cystine levels in the blood.
Metabolite levels had stronger associations with genetic variants influencing the structure and function of gene products than with those regulating expression ( Figure 2). This result is in contrast with the findings in the majority of disease-associated variants that are located in noncoding regions. 28 However, some variants were associated with metabolite level and disease (Figure 4, Tables S10 and S11), highlighting the difference and overlap of mQTL with disease-associated loci.
The A allele of rs1047891 in CPS1 increased serine, glycine, and acetylglycine and decreased uric acid and BUN in the blood. BUN is the end product of the urea cycle of which CPS1 functions in the first step. The Selection signature analysis indicated that most human blood metabolites were unrelated to fitness. In contrast, many clinical phenotypes have been under selection. 32 The variation in the level of some metabolites affects diseases but might be tolerant of fitness. Meanwhile, uric acid showed a remarkably negative selection signature. Hypouricemia is associated with several neurological disorders, such as Parkinson disease, Alzheimer's disease, and multiple sclerosis. 34 On the other hand, hyperuricemia can also cause various diseases, such as gout or renal disorder. 34 The multidirectional effect of uric acid on human health might explain its strong negative selection signature. These suggest that uric acid can serve as an essential biomarker of such diseases, affecting the fitness of the Japanese population.
The knowledge of genetic influences on human blood metabolites has been expanded by this study, providing valuable insights into their physiological, pathological, and selective properties. Our results demonstrate the importance of studying genetic effects on the metabolites in diverse ancestries.

Limitations of the study
This study has the limitation of a relatively small sample size compared to previous European studies. Approximately 75% of variant-metabolite associations detected in Europeans did not show associations in our dataset (p > 0.05, Table S5), which could be attributed to a low statistical power due to the sample size. The second limitation is that we selected 121 metabolites for mQTL analysis, which is only a part of those studied to date. A more comprehensive examination with a larger number of target metabolites iScience Article would allow more accurate comparisons of mQTL among ethnicities. The third point is that we used only GC-MS to quantify metabolites, and no verification was performed using other technologies. Each technology has its advantages and disadvantages, and measurements by other platforms may reveal new associations not identified in this study.

STAR+METHODS
Detailed methods are provided in the online version of this paper and include the following:

ACKNOWLEDGMENTS
We are grateful to the participants of the Nagahama Study for their invaluable contributions. We acknowledge UK Biobank for making GWAS summary data available. The study was partly supported by the Prac-  d Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request.

Study population
Plasma and DNA samples were obtained from participants who had taken an extensive health check between 2013 and 2015 in the Nagahama Prospective Genome Cohort for Comprehensive Human Bioscience (the Nagahama Study). 21 We measured the blood metabolite levels of 8,270 participants and genotyped 7,040 participants. After quality control, 4,888 samples with the age range of 35 and 81 years (mean, 59.0 years) comprising 67.9% females were used for the mQTL analysis. All participants were fully informed of the purpose and procedures of this study, and written consent was obtained from each participant.

Metabolite measurement
We collected blood samples (5 mL) of participants who fasted overnight from forearm veins into tubes containing ethylenediaminetetraacetic acid (EDTA; Termo, Tokyo, Japan). We performed sample preparation and GC-MS analysis in the following steps, as described in our previous study. 42 The internal standard solution (2-isopropylmalic acid, 0.1 mg/mL in purified water) and extraction solvent (methanol: water: chloroform = 2.5:1:1) were mixed at a ratio of 6:250, and added to 50 mL of each plasma sample. The resulting solution was mixed using a shaker at 1,200 rpm for 30 min at 37 C. After centrifugation at 16,000 3 g for 5 min at 4 C, 150 mL of the supernatant was collected and mixed with 140 mL of purified water. The solution was thoroughly mixed and centrifuged at 16,000 3g for 5 min at 4 C. Finally, 180 mL of the supernatant was collected and lyophilized. The lyophilized sample was dissolved in 80 mL of methoxyamine solution (20 mg/ mL in pyridine) and agitated at 1,200 rpm for 30 min at 37 C. We added 40 mL of N-methyl-Ntrimethylsilyltrifluoroacetamide solution (GL science, Tokyo, Japan) for trimethylsilyl derivatization, followed by agitation at 1,200 rpm for 30 min at 37 C. After centrifugation at 16,000 3 g for 5 min at room temperature, 50 mL of the supernatant was transferred to a glass vial. We performed GC-MS analysis using a GCMS-QP2010 Ultra (Shimadzu Corp.). The derivatized metabolites were separated on a DB-5 column (30 m 3 0.25 mm id, film thickness 1.0 mm) (Agilent Technologies, Palo Alto, CA). Helium was used as the carrier gas at a flow rate of 39 cm/s. The inlet temperature was 280 C. The column temperature was first held at 80 C for 2 min, then raised at a rate of 15 C/min to 330 C, and held for 6 min. One microliter of the sample was injected into the GC-MS in split mode (split ratio 1:3). The mass conditions were as follows: electron ionization mode with an ionization voltage of 70 eV, ion source temperature of 200 C, interface temperature of 250 C, full scan mode in the range of m/z 85-500, scan rate: 0.3 s/scan. Data acquisition and peak processing were performed using GCMS solution software version 2.71 (Shimadzu, Kyoto, Japan).
We identified low-molecular-weight metabolites as described previously. 42 Chromatographic peaks were identified by comparing their mass spectral patterns to those in the NIST library or Shimadzu GC/MS Metabolite Database Ver. 1. The identification of metabolites was further confirmed through the coincidence of retention indices in samples with those in the corresponding authentic standards. Retention indices were determined and calibrated daily by measuring the n-alkane mixture (C8-40) (Restek, Tokyo, Japan), which was run at the beginning of the batch analysis. We quantified each metabolite peak using the area under the curve and then normalized using an internal standard. iScience 26, 105738, January 20, 2023 iScience Article 0.94, respectively, Figures S52B and S52C) measured by clinical laboratory test using 4,888 samples, validating the accuracy of our measurement.
We identified 127 metabolites with known chemical structures in 8,270 samples. Among them, four metabolites were excluded because they were detected in water samples, and two were excluded due to the high relative standard deviation (>1). The median call rate of the remaining 121 metabolites was 99.99 (ranged from 48.0 to 100.0) %. Detailed information for each metabolite, including the biochemical name and class based on its chemical structure, is provided in Table S1. We then conducted a principal component (PC) analysis using the 121 quantified metabolite data of 8,270 samples. We removed four samples as outliers (from À103IQR (interquartile range) below the 25th percentile to 103IQR above the 75th percentile in one of the top two inferred axes of variation). To normalize measurement variations caused by inter-day instrument tuning differences, the medians of each run were aligned to 1.0 4,43 , and the proportion of other values was taken. Normalization effects on the machines were visually confirmed ( Figure S53). Moreover, PC was significantly associated with measurement dates and instruments before normalization (PC1: p < 1.1 3 10 À16 , PC2: p < 1.1 3 10 À16 , one-way ANOVA) but not after (PC1: p = 1.0, PC2: p = 1.0).

Genotyping and imputation
Samples were genotyped using Illumina Human610-Quad (n = 1,735), HumanOmni2.5 (n = 1,941), HumanCoreExome-24 (n = 1,721), or Asian Screening Array (n = 1,643). The alleles of all datasets were aligned to GRCh37. After initial quality control of samples and genotypes, we performed the pre-phasing of autosomes and X chromosomes using SHAPEIT ver. 2.837 to treat the male and female samples. Prephased autosomes were imputed into the 1000 Genomes Phase3 v5 reference panel 44 with minimac3 (ver. 1.0.14). 37 For the X chromosome, we performed the imputation independently for males and females except for the pseudo-autosomal region (from 10,001 bp to 2.6 Mb, from 154.9 Mb to 156 Mb). After imputation, SNPs with MAF smaller than 0.005 or imputation quality R 2 smaller than 0.3 were removed. Finally, 10,491,983 SNPs included in any of the four datasets were used for the association study. Details of the quality control of samples and genotypes are shown in Figure S54.

Association analysis and meta-analysis
In the association analysis, we excluded outliers (from À3 3 IQR below the 25 th percentile to 3 3 IQR above the 75 th percentile) for each metabolite. We calculated residuals for the quantitative metabolite trait by linear regression analysis using age, sex, and top ten PCs as covariates in each group genotyped using four typing kits. Then, a rank-based inverse normal transformation was applied to the estimated residuals. We assumed an additive genetic model and carried out an association test on imputed allelic dosages for these residuals by a linear regression model using PLINK (v2.00) 45 in each SNP array. For the X chromosome, we conducted GWAS separately for males and females and merged their results by inverse-variance meta-analysis. We combined the association results of four arrays using inverse-variance meta-analysis based on effect size estimates and standard errors, using METAL software (released March 25, 2011). 39 We set the genome-wide significance threshold at p = 4.1 3 10 À10 ; 5.0 3 10 À8 /121 (Bonferroni correction for the 121 metabolites).

Locus definition, putative causal gene assignment, and conditional analysis
For each metabolite, we combined variants with significant associations (p < 4.1 3 10 À10 ) located within 500 kb and obtained independently associated loci at least 500-kb apart from each other. We refer to such independent associated loci for each metabolite as 'mQTL' (i.e., these could overlap other mQTL). We determined the lead variant with a minimal p value in each mQTL. For each mQTL, we assigned the most likely causal genes by PRoGeM. 46 For this purpose, two databases for cis-eQTL were used. First, we used the significant eQTL (p < 5.0 3 10 À8 ) data of peripheral blood in the Japanese population, 23 and selected proxy variants (r 2 > 0.8) from the locus of the lead variant in the EAS population of 1KG. We also used the significant cis-QTL prepared in the Genotype-Tissue Expression (GTEx) project (v7) across all tissues assayed (n = 48), and selected proxies (r 2 > 0.8) of the lead variant from the EUR and the EAS in 1KG. We used the output of ''co-occurring'' candidates. When there were more than one candidate genes, the nearest gene from the lead variant among the candidate genes was assigned, and when there was no candidate genes, the nearest gene from the lead variant was assigned.

OPEN ACCESS
iScience 26, 105738, January 20, 2023 iScience Article To identify a putative causal variant in each mQTL, we sought variants among proxies that change (i) the amino acid residue (r 2 > 0.8 1KG EAS) or (ii) the expression level of the mapped gene. ANNOVAR 40 was used to obtain functional information, and the same eQTL database and r 2 threshold used for gene assignment were employed to obtain expression information. When multiple variants passed the above criteria, we sorted the variants in the order (i) to (ii) to select one variant. When multiple variants were within the same criteria, we chose the variant with the highest LD with the lead SNP in 1KG EAS.
To assess the biological plausibility between the metabolite and the assigned gene, we used the pathway information recorded in the KEGG database 47 (http://rest.kegg.jp) released in July 2020. We excluded ''Metabolic pathways'' (map01100) which included more than 1,000 genes.
A stepwise conditional analysis was performed in the region within 1Mb to the lead variant using the genomic dosage of the top associated variants as a covariate. The stepwise analysis was repeated until no SNPs with significant associations (p < 4.1 3 10 À10 ) appeared. The putative causal gene for additional variants was assigned in the same way as described above. The variance of metabolites explained by genome-wide significant SNPs was calculated by determining the coefficient of determination (R 2 ) in the multiple linear regression model, in which the dosage of the lead and additional SNPs were set as explanatory variables and the level of the metabolite (age, sex, and top 10PC-adjusted) was selected as the objective variable. The analysis was conducted using the statsmodel package in Python (ver. 2.7.5).

Novel locus identification
A detected mQTL was considered novel when its lead variant was located at least 500-kb away from any variants whose association with the corresponding metabolite has been reported. We collected known variant-metabolite associations from the following three resources. The first one is a summary of 40 GWAS for metabolites. 11 This summary lists SNP-metabolite associations (p < 5.0 3 10 À8 ) from 40 metabolite GWAS published from November 2008 to October 2018. We matched the metabolites of our study with this summary by using HMDB identifiers. 20 The second one is the GWAS catalog, 48 released September 24, 2021. We first converted chromosomal positions from hg38 to hg19 using the liftOver tool. 49 Then, we collected metabolite-SNP associations for each metabolite by searching its standard and alternate names registered in the HMDB database. 20 The third resource is the manually collected papers that were not included in either (i) or (ii). 10,15,16,19,43,[50][51][52][53] Comparison with European studies To compare our results with those in the European population, we extracted blood metabolite-variant associations detected in the European population from the following three resources. The first one is the summary of 40 metabolite GWAS 11 described above. From this summary, we selected metabolites whose HMDB identifiers 20 were given, excluded three studies 3,9,54 which partially or totally lacked information on the direction of allelic effects, selected variants detected in both our study and previous studies, and selected variant-blood metabolite associations. The number of studies and associations in each step are shown in Figure S55. The second resource is the GWAS Catalog (released September 24, 2021) 48 . We chose SNPs that have genome-wide associations (p < 5.0 3 10 À8 ) with the trait. Chromosomal positions were converted from hg38 to hg19 using the liftOver tool 49 . For each metabolite, we searched variant-metabolite associations by searching standard and alternate names registered in the HMDB database 20 . We further excluded studies that do not include the European population. From these associations, we excluded the association included in the first resource, excluded studies that partially or totally lacked information on the direction of allelic effects, selected variants detected in our study and previous studies, and selected variant-blood metabolite associations. The number of studies and associations in each step are shown in Figure S56. The third resource is variant-metabolite associations reported in papers concerning GWAS for human blood metabolites conducted in the European population, published after October 2018 until January 2022, and have not been included in either the first or second resource. 10 Finally, we combined the results of the three resources. If multiple SNPs existed within 500 kb distance for the same metabolite, the SNP having the lowest p-value among them was selected.

Multi-population correlation estimates
We used Popcorn (version 0.9.9) 41 to estimate the correlation coefficient for the per-allele effect size r ge between the current study and the European population 4 for 77 metabolites measured in both studies. We used the LD scores and summary statistics of variants present in the HapMap3 database 55 and with ll OPEN ACCESS MAF above 5% in both 1KG EUR and 1KG EAS. Here, we excluded 46 metabolites whose estimated heritability was less than 1.0 3 10 À3 in either one of the populations to calculate accurate multi-population genetic correlation. 56 The significance threshold was set to p = 6.5 3 10 À4 for the jackknife test of r ge < 1 to correct for 77 multiple tests.

Meta-analysis with ToMMo and multi-population analysis
We performed a meta-analysis with another Japanese cohort (ToMMo) and a multi-population analysis with a fixed-effect model based on Stouffer's Z-score method. METAL software 39 was used for sample size weighting. We downloaded publicly available GWAS summaries of the Japanese 35 and Europeans. 4 For the European results, we converted the genome coordinates of hg18 to hg19 using the UCSC liftOver tool. 49 For the meta-analysis with ToMMo, we analyzed the variants detected in both studies. As for multi-population analysis, we examined the variants identified in the Japanese (either the current study or ToMMo) and the European study. The significance threshold was set to p = 4.1 3 10 À10 . The same definition was used for the locus and assignment of the gene. The detected locus was considered ''additional'' when the lead variant was located at least 500-kb away from any variants showing genome-wide significance in the association with the corresponding metabolite.

Functional annotation, enrichment analysis, and heritability estimates
To obtain functional annotation of variants, we used ANNOVAR (April 16, 2018, released 40 ) for the portion of the gene, amino acid substitution and frameshift/stopgain. For promoter and enhancer region annotation, we used a chromatin state database of nine cell types. 22 In each type of function, we conducted enrichment analysis for mQTL using the following steps. 1) Let N v be the count of the number of the lead (Table 1) and additional variants (Table S3) (in total, n = 75) in LD (r 2 > 0.8) with a variant annotated in the selected function. The 1KG EAS population was used to calculate LDs. 2) We randomly selected 75 variants from the chromosome and MAF-matched (MAF difference <0.025) variants in 1 KG EAS data and counted the number of variants in the same manner as in step 1. 3) Step 2 was repeated 1,000 times, the counted number labeled as N 1 . N 1000 . 4) Fold enrichment of the selected function type was estimated by a mean of N v / N 1 to N v /N 1000 , and the empirical p value was calculated by sorting counts Nv and N 1 to N 1000 .

Phenome-wide association study
We examined the associations of 60 lead (Table 1) and 15 additional (Table S3) variants with 58 clinical measurements, using summary statistics of a Japanese GWAS. 25 When the variant was absent from the dataset, we selected the one from 1 KG EAS which showed the strongest LD with the metabolite-associated variant. The significance threshold was set to p = 1.1 3 10 À5 to correct for 4,350 (75 3 58) multiple tests. To visualize the results, Z values were clustered by Euclidean distance.
The association between the number of drinking days per week and metabolite levels was tested for the same 4,888 samples by an ordinary linear regression model using the statsmodel package in Python (ver. 2.7.5). As for drinking habit data, we used the results of the questionnaire, ''How many days do you drink per week?''. Age, sex, and top ten PC-adjusted metabolite quantification values were used.
The results were validated using the data from 4,888 samples of the Nagahama Study. Association analyses were performed using the same methods as those for the metabolites.

Association of mQTL with disease
We used the NHGRI GWAS catalog (released September 24, 2021) 48 and converted chromosomal positions from hg38 to hg19 using the UCSC liftOver tool. 49 First, we extracted variant-trait associations that passed all of the following criteria: i) the variant is present in our own dataset, ii) with allelic direction information, and iii) the trait was categorized as a disease based on the definition of experimental factor ontology. 57 Next, we assigned the lead and the additional variants to the variant-trait association using the following two criteria: i) the variant is in high LD (r 2 > 0.8) with the trait-associated variant in 1 KG